s3parquet destination + OOM tuning (final stack layer) — fixes 3 pre-existing test failures#29
Merged
Merged
Conversation
New `-dest s3parquet:<endpoint>` destination accumulates ProtobufList
Envelopes into in-memory Parquet builders, finalizes at a configurable
byte threshold (default 63 MiB), and PUTs the object via minio-go to
S3-compatible storage. Hive-partitioned object keys
(host=…/date=…/hour=…) keep `s3()` table-function consumers and
DuckDB pruning cheap.
Architecture:
- Async worker owns the parquet-go GenericWriter and minio client;
Send only marshals the bytes into a bounded queue (16 slots) and
returns, so the Poller never stalls on uploads. Queue-full bumps
a Prom counter and falls through to a blocking send (back-pressure
visible without data loss).
- Hand-written ParquetRow struct mirrors XtcpFlatRecord with
parquet: snake_case + per-column codecs (ZSTD for strings/bytes,
SNAPPY for numeric/timestamp). A drift test reflects the proto
FileDescriptorSet at unit-test time and fails CI if columns
diverge.
- Object keys sanitized for path traversal / NUL / control chars
before they touch the S3 PutObject call; a hacker-attacker
test suite asserts ../../etc/passwd-style hostnames cannot escape
the prefix and S3 secret keys never appear in error paths.
CLI / config: -s3Endpoint, -s3Bucket, -s3Prefix, -s3AccessKey,
-s3SecretKey, -s3Region, -s3ParquetFlushBytes (+ S3_* env overrides;
secrets logged as "set" only). Proto fields s3_endpoint=125 …
s3_region=133 (130/131 skipped to avoid the existing `dest` slot).
Vector retired in the same commit:
- vector-pipeline.nix, xtcp2-vector-path.nix, self-test-vector.nix
deleted; vector branches in mkVm.nix / microvms/default.nix /
nix/default.nix removed (isVector, vectorModules, xtcp2VectorArgs,
vmsVector, lifecycleVector, checksVector, microvm-x86_64-vector,
microvm-x86_64-lifecycle-vector). Vector was misconfigured for the
ProtobufList envelope wire format and wrote JSON, not Parquet —
s3parquet supersedes its intended role with one fewer process and
no descriptor-set mount.
- mkProtoDescSet helper and the `xtcp-flat-record-desc` package
remain exposed for external consumers that still want the .desc
artifact.
Microvm flavor `s3parquet` (sink="s3parquet") reuses the existing
minio-bucket-bootstrap module. Lifecycle self-test adds two sentinels:
S3PARQUET_FILES_PASS (≥1 .parquet object in MinIO within 90 s) and
S3PARQUET_ROWS_PASS (DuckDB row count ≥1 from the produced object).
Both pass in CI with 1204 rows landed after a 60 s boot.
Test coverage in all six categories (positive/negative/boundary/corner/
adversarial/hacker-attacker) plus Benchmarks and a concurrent
sends+close race test under `-race`.
Validator change: schemeS3Parquet joins the path-style exemption in
input_validation.go since the endpoint URL (http://host:port) has its
own colons; the strict x2-colon rule still applies to kafka/nats/nsq/
valkey/udp.
Vendor hash + allLibraryDestinations updated for the new minio-go +
parquet-go deps.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…te heartbeats
Adds a new microvm flavor `s3parquet-long` paired with `mkS3ParquetRunner`
(`nix run .#microvm-x86_64-s3parquet-runner -- --duration <5m|12h|…>`).
Mirrors the existing soak/tcp-stress runner pattern: boots the VM, sleeps
for the requested duration, prints a heartbeat each 30 s (short runs) /
5 min (long runs), then powers off with a markdown-style summary table
of per-sentinel file deltas.
Flavor mechanics:
- `sink = "s3parquet-long"` reuses the existing minio-bucket-bootstrap
module, the s3parquet destination, and the soak nsTest/tcp_server/
tcp_client traffic generators (so xtcp2 always has a populated
netlink readout to feed the parquet writer). 1 MiB flush threshold
keeps the file count visible at short durations; edit
xtcp2S3ParquetLongArgs to 67108864 (or omit the flag) for
production 63 MiB testing.
- Self-test is skipped (`!isSoak && !isS3ParquetLong`); a new
systemd unit `xtcp2-s3parquet-monitor.service` emits one sentinel
line per `S3PARQUET_REPORT_INTERVAL` seconds (default 60 s):
XTCP2_S3PARQUET_HOURLY <ts> files=<n> bytes=<n> rows=<n>
- The monitor sources its numbers from xtcp2's own Prometheus
counters (`destS3Parquet/upload`, `uploadBytes`, `uploadRows`)
via `curl /metrics`. An earlier `mc find` implementation was
too slow under nsTest load — Prometheus is authoritative and
~1 ms per scrape.
Runner mechanics:
- Reads heartbeat counts off the in-VM sentinels in the serial
transcript (host-side mc through the forwarded port doesn't
actually route in this microvm setup — qemu reports the port as
LISTEN but curl times out).
- `--report-interval` is honored only as a sanity check in the
summary's min-expected-reports math; the in-VM cadence is
baked at build time.
- `--rss-cap-mb` parameter wired but inactive (RSS scrape from
the host requires VM introspection we don't have); kept as a
hook for a follow-up.
- Summary: total files, total bytes, total rows, panics, restarts,
and the full per-sentinel delta table.
Bucket-bootstrap module now binds MinIO to 0.0.0.0 instead of
127.0.0.1 so the (currently disabled) host-side forwarded-port path
would work if microvm.nix's hostfwd routing ever gets fixed. Inside
the VM nothing changes — xtcp2 still talks to MinIO via 127.0.0.1.
Phase B (5 m): 52 files PASS.
Phase C (30 m): 366 files PASS, steady ~12-14 files/min delta, zero
panics/restarts, in-VM memory stable.
Phase D (2 h at production 63 MiB) and Phase E (12 h, production-
shaped) remain user-triggered:
nix run .#microvm-x86_64-s3parquet-runner -- --duration 12h
The defaults give ~12 files/min at 1 MiB threshold; switch the
threshold in xtcp2S3ParquetLongArgs for production-size objects.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Before launching the 12 h soak. At the steady ~1 MB/min raw-row rate observed in the 30 min smoke, a 12 h run produces ~12 finalized objects — matches the user's "multiple files after 12 hours" expectation and exercises the production-sized object path the 1 MiB smoke threshold doesn't. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a Pyroscope-go agent inside xtcp2 and an in-VM Pyroscope OSS
server so operators can stream and visualise CPU, alloc, in-use,
goroutine, mutex, and block profiles without a separate profiling
infrastructure. Motivated by the 12 h s3parquet-long soak hitting
`fatal error: thread exhaustion` at 1h 45min — a goroutine/thread
leak in the namespace-handler hot path that pprof one-shots
couldn't localize.
xtcp2 (Go):
- New deps: github.com/grafana/pyroscope-go + godeltaprof
- New CLI flags + proto fields (136-139):
-pyroscopeUrl (empty disables the agent)
-pyroscopeAppName ("xtcp2" by default)
-pyroscopeSampleHz (100 Hz default)
-pyroscopeUploadSec (15 s default)
All five profile types start when -pyroscopeUrl is non-empty.
Empty URL is zero-overhead — production runs that don't want the
agent simply leave the flag unset. Secrets aren't applicable
(Pyroscope endpoints are usually authenticated by network policy
or a sidecar; we don't ship credentials in argv).
NixOS module nix/modules/pyroscope-server.nix:
- Wraps services.pyroscope to run a single-binary all-in-one
server with filesystem-backed storage (no external S3/Azure
blob dependency). Listens on 0.0.0.0:14040 in the VM (4040 is
occupied by something else inside the NixOS boot; investigated
briefly then sidestepped — 14040 works cleanly).
- Drops DynamicUser → runs as root inside the disposable VM so
writes to /var/lib/pyroscope/blocks succeed without the
nixpkgs default's StateDirectory-vs-tmpfs choreography.
- Forces stderr/stdout onto journal+console so future startup
failures surface on the serial transcript (the default
journal-only logging hid the real "bind: address already in
use" diagnostic across three earlier debugging cycles).
Microvm wiring:
- s3parquet-long flavor imports pyroscope-server.nix and passes
-pyroscopeUrl http://127.0.0.1:14040 -pyroscopeAppName
xtcp2.s3parquet-long into xtcp2's extra args.
- Forwards host:14040 → guest:14040 so an operator can hit the
Pyroscope UI at http://127.0.0.1:14040 if QEMU hostfwd is
working (it intermittently isn't in this microvm setup, but
the agent still streams profile data inside the VM regardless).
- In-VM monitor now also emits go_goroutines + go_threads in
the XTCP2_S3PARQUET_HOURLY sentinel — a per-minute leak
indicator visible directly in the runner summary without
needing the Pyroscope UI.
Phase G validation: 30 min s3parquet-long soak PASS, 6 finalised
63 MiB parquet objects, 0 panics, 0 restarts, Pyroscope agent
shipping all five profile types every 15 s. Ready for the
follow-up 2+ hour leak-diagnosis run.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 12 h s3parquet-long soak hit `fatal error: thread exhaustion`
at 1 h 45 min — over 2000 OS threads accumulated. Pyroscope's live
goroutine profile (now reachable from the host via the firewall
fix in the same change) showed the leaking call site clearly:
50 goroutines @ ns_net_namespace.go:141 (<-nsCtx.Done())
33 goroutines @ ns_net_namespace.go:281 (Setns backoff)
each holding runtime.LockOSThread()
The deferred restore-netns Setns kept failing with EPERM under
nsTest churn at 250 ms cadence. The previous code accepted this:
counted the error, kept the goroutine going, then UnlockOSThread'd
the *tainted* M (now in a deleted netns) back to Go's scheduler.
The runtime tried to reuse it, hit the wrong-netns mismatch on
the next syscall, and was forced to spin up a fresh M every time
— growing the M-pool past the SetMaxThreads(2000) ceiling.
Fix: make UnlockOSThread conditional on the restore Setns
succeeding. On EPERM we skip the unlock — the goroutine exits
while still holding the lock, and the Go runtime terminates the
OS thread (documented runtime.LockOSThread behaviour) instead of
recycling a tainted M.
Cost: one OS-thread creation per failed restore (~10 µs). At
4 ns events/sec for 1 h that's ~14 k thread creations totalling
~140 ms of overhead. Versus the prior unbounded accumulation
leading to crash, the trade is obvious.
Other observability landings in this commit that supported the
diagnosis:
- nix/microvms/mkVm.nix: open the s3parquet/MinIO/Pyroscope ports
in networking.firewall.allowedTCPPorts so QEMU usermode hostfwd
packets actually reach the listeners (the previous firewall block
only enumerated tcp-stress + clickpipe). curl/browser from the
host can now hit pyroscope :14040, MinIO :9000/:9001, and xtcp2
/metrics + /debug/pprof on :9088.
- cmd/xtcp2: register net/http/pprof side-effect import so
/debug/pprof/{goroutine,heap,…} is available on the prom port
without standing up a separate debug server. Used to capture
the goroutine stack distribution that pointed at the leak.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous commit fixed the M-recycling symptom; this commit fixes
the root cause. Pyroscope diagnostics from the validation run
exposed the actual signal:
restoreNs error: 12116 (100% of 12,116 attempts failed)
restoreNs count: 0
Every single setns(CLONE_NEWNET) restore was failing with EPERM.
Decoding xtcp2's init-time capability dump (Effective = 0x1003000)
confirmed why: the service only had CAP_NET_ADMIN + CAP_NET_RAW +
CAP_SYS_RESOURCE. setns(CLONE_NEWNET) requires CAP_SYS_ADMIN in
the target netns's userns; without it, both the initial setns
into a new ns AND the restore back to the original ns fail.
The retry loop in openAndSetNSWithRetries spun all 10 attempts
under EPERM holding a LockOSThread'd OS thread; the previous
unconditional defer UnlockOSThread (now conditional) handed the
tainted M back to the scheduler; thread count grew without
bound; SetMaxThreads(2000) ceiling crashed the daemon at 1h 45min
under nsTest's 4-evts/sec churn.
clickhouse-pipeline runs survive 12+ h because clickpipe doesn't
run nsTest churn — its namespace surface is whatever docker creates
(handful of containers, minutes between events). Soak + s3parquet-long
both run nsTest at 250 ms cadence and hit the wall.
Granting CAP_SYS_ADMIN means setns succeeds on the first attempt,
restore succeeds, M is properly recycled by the runtime, thread
count stays bounded by the active-namespace working set (~50-300
in steady state, not unbounded growth). The conditional UnlockOSThread
from the prior commit remains as defense-in-depth for any future
environment where CAP_SYS_ADMIN is dropped or scoped differently.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Setns pattern
Two layers of defense against re-introducing the OS-thread leak that
crashed the 12 h s3parquet-long soak:
1. **Regression test** `pkg/xtcp/ns_thread_leak_test.go`:
- Uses the existing test seam (now extended with a restoreNsSetns
hook) to force the restore-Setns to return EPERM, mirroring the
production microvm scenario.
- Runs N=400 iterations of the LockOSThread + restore-fails + exit
pattern with `debug.SetMaxThreads(150)` so any leak panics
immediately instead of looking slow.
- Asserts /proc/self/status:Threads delta stays ≤ 80 across the
run. Without the fix the test would either panic on the thread
cap or fail the delta bound. With the fix delta=1 in practice.
2. **Forbidigo linter rule** in .golangci.yml:
- Bans bare `runtime.UnlockOSThread`. Callers must opt in with
`//nolint:forbidigo // <reason>` documenting why the unlock is
safe in that context. Forces the next person who writes a
LockOSThread/Setns pairing to confront the bug class at the
line they're writing.
- The rule immediately caught a SECOND occurrence in
`pkg/xtcp/ns_watch.go::createNetworkNamespace` — same bug,
same fix (conditional unlock inside the restore defer).
- All legitimate uses (io_uring SQ-thread pinning, CPU-pin in
bench tests) annotated with nolint + justification.
Together: the linter catches the static pattern at write time; the
regression test catches the runtime behaviour if someone bypasses
the linter. Either alone would be incomplete; together they cover
both the "removed conditional" and "added unconditional" regression
shapes.
Includes:
- Restore-Setns seam (`restoreNsSetns` var) in ns_net_namespace.go
so tests can force the restore-failure code path without needing
real CAP_SYS_ADMIN or live namespaces.
- gofmt + goimports drift fixes in cmd/xtcp2 / xtcp2_test.go /
udp_receiver_server_test.go that surfaced when the lint became
stricter.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…k-fail microvm flavor
When the 12 h soak crashed with `fatal error: thread exhaustion`,
the missing CAP_SYS_ADMIN was the proximate cause but the user had
to bisect across the runtime to find it. This commit makes that
class of misconfiguration loud at startup instead of hours-later
under stress.
Go side (pkg/xtcp/init_capabilities.go):
- Replace the legacy CAP_NET_ADMIN + CAP_SYS_CHROOT check (the
chroot one was never actually used) with a structured
requiredCaps table:
CAP_NET_ADMIN fatal — no netlink inet_diag without it
CAP_SYS_ADMIN fatal — setns(CLONE_NEWNET) needs it
CAP_NET_RAW warn — raw-socket destinations fail
CAP_SYS_RESOURCE warn — io_uring rings get bounded
- Print a per-cap diagnostic at startup. On any fatal-tier
capability missing, exit cleanly via x.fatalf() with a multi-line
message that names each missing cap, explains the failure mode,
AND emits a ready-to-paste systemd snippet so the operator can
fix the config in one copy. Soft-required caps surface as
warnings; daemon continues.
- pkg/xtcp/init.go: promote checkCapabilities from log-only to
fatal-exit. Hard-required missing caps refuse to start the daemon
rather than letting it limp and crash later.
Tests (pkg/xtcp/init_capabilities_test.go):
- Rewrite over the new requiredCaps table.
- New cases:
hasAllRequired, hasEverything — happy paths
missingNetAdmin — fatal diagnostic
missingSysAdmin — fatal diagnostic
(the original
12 h soak bug)
missingOnlySoftCaps — warnings + nil err
missingBothHardCaps — both named in err
capgetErr — error wrapping
- Each fatal-path assertion pins on the expected substring
(capability name + remediation hint) so a regression in the
message would surface in CI.
Microvm wiring:
- nix/modules/xtcp2-service.nix gains a `capabilities` option
(defaults to the full set). The systemd unit uses it for both
AmbientCapabilities and CapabilityBoundingSet so test flavors
can drop one to validate the fail-early path.
- mkVm.nix adds sink="capcheck-fail": same s3parquet-long config,
but `services.xtcp2.capabilities` deliberately omits
CAP_SYS_ADMIN. xtcp2.service then refuses to start; systemd
prints the diagnostic to the serial console on each restart
attempt.
- Exposed as flake package microvm-x86_64-capcheck-fail.
Verified end-to-end: booting microvm-x86_64-capcheck-fail shows
the expected diagnostic on the serial transcript, and xtcp2.service
enters a Restart=on-failure loop instead of the silent thread-leak
behaviour it had before.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
User feedback: the microvm-x86_64-capcheck-fail flavor is overkill
for what we're verifying — does the daemon exit with a clear
per-capability diagnostic when CAP_SYS_ADMIN is missing? A normal
`nix runCommand` derivation can spawn xtcp2 in the build sandbox
(which runs as an unprivileged user with no elevated caps) and
assert the same end-to-end behaviour in under a second.
New checks (auto-added to `nix flake check`):
- capability-check-no-caps names CAP_NET_ADMIN
- capability-check-names-sys-admin names CAP_SYS_ADMIN
Both spawn xtcp2 with -dest null -maxLoops 1 in the Nix sandbox.
Without any privileged caps the startup checkCapabilities path
fires, the daemon fatal-exits, and the test asserts the stderr
contains the missing-cap name plus the systemd remediation snippet
("AmbientCapabilities", "CapabilityBoundingSet"). The pinned
substrings would surface any future weakening of the diagnostic
in CI.
The microvm-x86_64-capcheck-fail flavor stays for full-stack
validation (systemd ambient-cap config → xtcp2 → restart loop) but
is no longer the routine check.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… pipeline The prior 12 h validation proved the OS-thread leak is fixed (drift 277→317 over 11 h, vs the previous unbounded growth that crashed at 1 h 45 min). But it ran "FAIL: no parquet files landed" because most nsTest-churned namespaces are socket-empty, so xtcp2's per-namespace netlink poll returned nothing and the parquet writer had nothing to batch. The leak got fixed but the workload never stressed the codepath that broke. Three knobs to put genuine pressure on the same path: 1. soakInitialNs: 50 → 200 (4× concurrent namespace working set) 2. soakChurnSleep: 250 ms → 100 ms (2.5× ns event rate) 3. new xtcp2-soak-ns-traffic systemd unit (the big one) (3) is a small shell driver that continuously scans /run/netns/ and, for every nsX it finds, fires `ip netns exec <ns>` with a brief loopback ncat listener+connector pair INSIDE the namespace. The pair lives ~50 ms before the listener exits — long enough for xtcp2's next per-namespace netlink poll to catch the ESTABLISHED state, plus the subsequent TIME_WAIT. A concurrency cap of 30 in-flight injectors caps host fork pressure even with soakInitialNs=200. Net effect on the workload (vs prior run): - ns event rate: 4 evts/sec → 10+ evts/sec - in-flight namespaces: ~50 → ~200 - envelopeRows/12h: ~73 → expected many thousands - finalized parquet files/12h: 0 → expected ≥10 If the leak fix still holds under this load — and the parquet pipeline survives sustained envelope production for 12 h — the bug class is genuinely closed. If anything ELSE breaks (file descriptor limits, parquet builder memory, MinIO upload backpressure), we catch it here instead of in a customer deployment. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
First aggressive 12 h soak attempt: the unit was SKIPped at boot with "Ordering cycle found, skipping xtcp2 soak — in-namespace TCP loopback injector". My `after = [xtcp2-soak-churn.service ...]` formed a cycle with the implicit multi-user.target dep chain. The driver script already handles `/run/netns/` being empty (sleeps 0.5 s and re-checks), so the dep was decorative — drop it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The aggressive 12 h soak ran flat files=0 despite the new soak knobs
(200 ns, 100 ms churn, ns-traffic systemd service). Root cause:
the previous shell-based injector lost the race between
`ls /run/netns/` and `ip netns exec` — the ns was gone by the time
exec ran ("Cannot open network namespace nsXX"). Bumping
concurrency didn't help; the script's own bash interpreter wasn't
even in PATH ("exec of bash failed").
Cleaner fix: have nsTest itself open the loopback connection
immediately after `ip netns add`, in-process. No race possible
(we hold the ns reference) and no PATH issues.
Implementation:
- New -traffic flag (default false).
- After each `ip netns add`, on a LockOSThread'd goroutine:
1. snapshot origNs
2. setns into the new ns
3. `ip link set lo up` (shell — one-shot at ns-creation time,
latency immaterial at ~10 creates/sec)
4. open net.Listen on 127.0.0.1:0 + net.Dial back to it +
exchange one payload + close
5. setns back to origNs; conditional UnlockOSThread (same
pattern as the netNamespaceInstance fix — on Setns restore
failure leave the lock held so the runtime terminates the
OS thread instead of recycling a tainted M)
- Each TCP exchange leaves a TIME_WAIT pair in the ns's kernel
socket table for ~60 s; the ns lives ~20 s under the soak's
100 ms churn cadence so xtcp2 sees socket state on every poll.
Wiring:
- soakChurnScript now passes -traffic to nsTest.
- The old shell-based xtcp2-soak-ns-traffic systemd unit is left
guarded behind `lib.mkIf false` — not deleted yet so future
reference debugging can compare approaches.
Sanity: 5 min smoke with -traffic produced
Netlinker 2 packets:8, n:192, p:2, fd:... ns:/run/netns/ns86
vs the prior empty
Netlinker N packets:Y, n:20, p:0, ...
Files still 0 at 5 min because the 63 MiB flush threshold needs
more accumulated envelope bytes — addressed by the upcoming 12 h
soak which has the runtime to hit it.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d io profiles
The prior -traffic mode opened one brief loopback exchange per ns
(leaving a TIME_WAIT pair visible to inet_diag for ~60s). That
worked but gave xtcp2 only ~2 sockets per ns with identical
TCP_INFO. -conns 100 instead opens 100 listener+dialer pairs per
ns and keeps them alive for the ns's lifetime; each conn picks a
profile from the cross product of 5 payload sizes (16 B / 256 B /
4 KB / 16 KB / 64 KB) × 4 send intervals (1 / 10 / 100 / 500 ms),
so the TCPInfo spread across 200 conns per ns is real.
Lifecycle:
- startPersistentTraffic spawns a setup goroutine on a LockOSThread'd
thread: setns into the new ns, `ip link set lo up`, open all N
listener+dialer pairs, setns back, conditional UnlockOSThread
(same pattern as netNamespaceInstance — on Setns restore failure
keep the lock held so the runtime terminates the OS thread).
- Once the sockets are open the io goroutines do NOT need to be on
the LockOSThread'd thread; the sockets carry their netns identity.
So 2N io workers per ns × 200 ns = 20k goroutines, but only ~200
OS threads tied to ns work.
- stopPersistentTraffic is called immediately before `ip netns del`
in the churn loop: cancels the ns ctx, closes all sockets,
bounded 2 s drain wait. Clean shutdown means no EBADF/EPIPE
noise in the journal during normal churn.
- Per-ns state lives in a sync.Map keyed by ns name.
Wiring:
- soakConnsPerNs = 100 added to mkVm.nix.
- soakChurnScript invokes nsTest -conns ${soakConnsPerNs} (replaces
the -traffic flag for the long-soak flavor; -traffic itself is
kept for backward compat with shorter smoke flows that only need
the one-shot TIME_WAIT injection).
5 min smoke under init-burst saw ~20k near-simultaneous connect()
calls overwhelm the loopback path (dial timeouts on ~30% of
init-fill conns). 2 s dial timeout + silent skip-on-fail handles
the noise — by t=5 min the system is stable and producing files.
1 h test PASS: 10 parquet files / 52.9 MB, 0 panics, 0 restarts,
threads stable at 1034. 6× the per-hour parquet throughput of
the previous 12 h soak (which only managed 17 files in 12 h with
the brief-injection -traffic mode).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… side
Mixed flavor that runs the existing clickpipe stack (redpanda +
clickhouse + grafana + prometheus, docker) PLUS in-VM MinIO + a
second xtcp2 instance writing parquet directly to MinIO. Validates
the "operator wants both wire formats out of one host" deployment
shape and exercises ClickHouse's s3() table function against the
parquet objects xtcp2 produces.
What's wired:
- sink = "clickhouse-pipeline-parquet" → mkOneClickPipeParquet.
- isAnyClickPipe / isAnyS3Parquet convenience predicates so shared
infra (docker volume, port forwards, firewall, prom/grafana,
clickpipe-up unit, MinIO bucket bootstrap) lights up for both
flavors via one gate change each instead of N matches.
- New `systemd.services.xtcp2-parquet` unit, scoped to isClickPipeParquet:
runs `${xtcp2Package}/bin/xtcp2` with xtcp2ClickPipeParquetArgs:
-dest s3parquet:http://127.0.0.1:9000
-s3Bucket xtcp2-records
-s3ParquetFlushBytes 4194304 (4 MiB; gives turnover within
a 30 min smoke run)
-promListen :9089 -grpcPort 8890 (off the primary's :9088/:8889)
Same caps as the primary xtcp2 (CAP_NET_ADMIN + CAP_NET_RAW +
CAP_SYS_RESOURCE + CAP_SYS_ADMIN).
- Primary xtcp2 (kafka path, xtcp2ClickPipeArgs) runs unchanged.
ClickHouse container gets `--add-host host.docker.internal:host-gateway`
on its docker run so the s3() function can reach the in-VM MinIO at
http://host.docker.internal:9000 from inside the bridge network.
The mapping is a no-op for plain clickpipe runs that don't use s3().
self-test.nix gains a new optional `runClickhouseParquetCheck` param:
- Check 15: `SELECT count() FROM s3('http://host.docker.internal:9000/
xtcp2-records/**/*.parquet', '…', '…', 'Parquet')` via the
clickhouse container. Polls up to 90s for the first parquet
object to land (4 MiB threshold).
- Emits XTCP2_SELF_TEST_CLICKHOUSE_PARQUET_{PASS,FAIL}.
Exposed at the flake level as:
- packages.microvm-x86_64-clickhouse-pipeline-parquet
- apps.microvm-x86_64-clickhouse-pipeline-parquet (boots the VM directly,
same pattern as the plain clickhouse-pipeline app).
Next: short hand-driven boot to verify both xtcp2 instances start cleanly
and ClickHouse can resolve host.docker.internal, then wire into a
proper lifecycle test.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ower flush for smoke Two follow-ups after the first boot of the clickhouse-pipeline-parquet flavor: 1. The second xtcp2 instance was bound on :9089 / :8890 inside the VM but the host couldn't reach it because the QEMU hostfwd table only listed :9088 / :8889 (the primary instance's ports). Added matching forwardPorts entries + firewall openings under `lib.optionals isClickPipeParquet`. Operators can now hit http://127.0.0.1:9089/metrics for the parquet pipeline's prom counters side-by-side with :9088 for the kafka pipeline. 2. Dropped xtcp2ClickPipeParquetArgs's -s3ParquetFlushBytes from 4 MiB to 256 KiB. The mixed flavor exists primarily to validate the kafka + parquet + ClickHouse-reading-parquet plumbing in a short smoke; 256 KiB flushes within ~30 s of boot and gives the self-test check immediate signal. Production deployments using the same pattern should set this to the 63 MiB default by editing the flavor. End-to-end verified: ClickHouse's s3() table function reading from host.docker.internal:9000 (the in-VM MinIO via the bridge gateway alias added in the previous commit) now returns row counts from the xtcp2-written parquet objects. 600 rows in one parquet file at +90 s, alongside 72 rows in the kafka path (still ramping up the clickhouse kafka-engine consumer). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…+ 8 GB ClickHouse First 2 h mixed-flavor soak hit ClickHouse's container-level memory cap (3500m default from the clickpipe flavor) — 222 MEMORY_LIMIT_EXCEEDED errors over the run, blocking the kafka_engine MV. The parquet pipeline was unaffected (it writes through MinIO, not through ClickHouse) but the goal of the mixed flavor is to validate BOTH paths in one VM, so the kafka path needs room to operate. Two coupled changes: - constants.nix: new `memClickPipeParquet = 12288` (vs 6144 for plain clickpipe). Headroom for: ClickHouse (~5 GiB peak under the mixed load), Redpanda (~700 MiB), MinIO (~300 MiB growing), 2× xtcp2 instances (~500 MiB each), dockerd, page cache, kernel. - mkVm.nix: new `clickPipeClickhouseMemory` let-binding picks the container --memory= based on isClickPipeParquet — 8000m for the mixed flavor, 3500m for plain (unchanged, keeps the 12 h-validated budget). Wired into the docker run. The 12 GiB VM is non-trivial; the plain clickhouse-pipeline flavor keeps its 6 GiB budget so existing soak runs aren't perturbed. Only the mixed flavor takes the larger footprint, and it's the same order as a typical operator running clickpipe + parquet on one box. Next: re-run the 2h mixed soak with the bumped budgets and confirm kafka_engine MV catches up to xtcp2's produce rate alongside the parquet pipeline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…me OOMs Root cause of the persistent MEMORY_LIMIT_EXCEEDED storm in the mixed flavor was NOT MV / parts-merge memory pressure but ClickHouse's OWN observability tables. Under the mixed workload (2× xtcp2 + nsTest churn + kafka_engine + s3 reads) the periodic flushes into system.latency_log, metric_log, asynchronous_metric_log, processors_profile_log accumulate fast — then their background merges trip the per-server max-memory cap before the user kafka MV gets a chance. Bumping memory just raised the cap; the workload kept up. With config.d/disable_chatty_logs.xml mounted into the container, MEMORY_LIMIT_EXCEEDED dropped from 903 to ~28 over the same 15 min smoke window and xtcp.xtcp_flat_records started ingesting again (parquet path was always fine — s3() roundtrip returns ~22 k rows). Keep memClickPipeParquet=16384 / --memory=12000m as cheap insurance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ClickHouse's kafka_engine defaults `kafka_max_block_size` and `kafka_poll_max_batch_size` to `max_block_size` (65,505). With our ProtobufList wire format — each kafka message is an `Envelope` that expands into ~100–1000 `XtcpFlatRecord` rows — a single poll cycle wants to materialize 6.5M–65M rows in memory before flushing. That's what `StorageKafka::threadFunc` was OOMing on in the mixed clickhouse-pipeline-parquet flavor (~2500 sockets fattening envelopes). After the chatty-logs disable last commit, the remaining OOMs were all on this path. Capping to 256 messages/poll bounds the working set at ~256 × avg-envelope-size rows; the MV still flushes 64K-row blocks to the MergeTree, just one block at a time. Verified via `SHOW CREATE TABLE` on the live consumer and via err.log — `StorageKafka` no longer appears in the OOM stack traces. Doesn't (yet) fix the deeper MV-halt symptom: the consumer still hits intermittent ProtobufList BAD_ARGUMENTS errors when the proto file is briefly unavailable during the docker entrypoint's chown of /var/lib/clickhouse/format_schemas/. Tracking as a follow-up — the schema-race needs either a startup barrier or kafka_skip_broken_messages turned up so individual schema failures don't halt the consumer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_size Two related changes to reduce the OOM pressure and stabilize the kafka_engine MV in the mixed clickhouse-pipeline-parquet flavor: * In `clickpipe-up`, after ClickHouse accepts queries, add a ProtobufList schema-warm probe (`SELECT * FROM xtcp.xtcp_flat_records LIMIT 0 FORMAT ProtobufList SETTINGS format_schema=...`). LIMIT 0 produces no rows but ClickHouse still constructs the ProtobufList output format object, which opens the proto file and resolves the message type. Forces the file to be in its final state (post entrypoint chown) before xtcp2 starts producing. * Lower `kafka_poll_max_batch_size` 256 → 64. With 256 the consumer drained the kafka backlog as fast as it could on first poll, overran the MergeTree's merge throughput, and the resulting parts- merge memory pressure OOM'd the consumer's next allocation. 64 smooths the insert rate enough that merges keep up. Combined effect at T+5m of a fresh boot: - chatty-logs only baseline: ch_rows=2584 OOMs=13 - + batch=256 (first attempt): ch_rows=7448 OOMs=826 (cascade) - + batch=64 + schema-warm: ch_rows=4871 OOMs=11 OOMs are now solidly in the single digits per 5min interval. Doesn't fully fix the kafka MV halt: the kafka_engine consumer still hits a `BAD_ARGUMENTS: Could not find a message named ...` on its SECOND poll batch (~1 min after producer starts). The schema-warm above proves the schema is loadable for SELECT...FORMAT, but the kafka_engine rebuilds its pipeline each flush_interval (5s) and re-loads the schema independently — that re-load occasionally fails. Next step (separate fix) is either kafka_skip_broken_messages > 0 so a transient schema-lookup failure isn't terminal, or a longer-living schema cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The SELECT...FORMAT ProtobufList probe goes through a different schema loader than kafka_engine — its source-tree importer reports 'CANNOT_PARSE_PROTOBUF_SCHEMA: File not found' for the same file that kafka_engine successfully parses moments later. The probe was failing every boot for the full 30s window, clickpipe-up.service exited with FATAL, and xtcp2 started anyway because `After=` is permissive. So the OOM improvements that landed in 72b2dd2 are entirely from kafka_poll_max_batch_size=64 — keep that. The probe code was dead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After capturing the full ClickHouse server log on a fresh boot, the schema errors and the apparent MV halt have a much simpler explanation than what the prior commit messages (8db5dbd, 72b2dd2) claimed: 1. The "Could not find a message named ..." errors in system.kafka_consumers are NOT a ClickHouse-25.3 ProtobufList cache bug, and they're NOT a recurring runtime issue. They come from the official docker entrypoint's pattern of running a temporary 127.0.0.1-only server to execute the initdb scripts, then SIGTERMing it before starting the real server. Our kafka_engine table attaches in that temp server, the consumer thread loads the schema during shutdown, fails BAD_ARGUMENTS, and the failure entry sticks around in system.kafka_consumers.exceptions (capped at 10 entries) — but the consumer in the second/real server starts clean and runs fine. You can see two `Application: Starting ClickHouse` events in clickhouse-server.log, ~3 s apart, every boot. 2. The "MV halts at N rows" symptom across the 30-min probe windows wasn't a halt — `Pushing N rows ... took 37152 ms` / 146775 ms entries in the log show individual kafka_engine flushes are taking 30-150 s each under the mixed flavor's ingest rate. ch_rows incrementing by ~2.4 k every 30 min IS the consumer running normally, just slowly. last_poll_time stays current. The code changes from those commits are still correct: the OOM mitigations (kafka_poll_max_batch_size=64, chatty-logs disable) really do reduce MEMORY_LIMIT_EXCEEDED pressure end-to-end. But the rationale attached to 72b2dd2 about kafka_engine reloading the schema per flush_interval is wrong — remove the bogus claim from the SQL comment and document the actual root cause in docs/integration-testing.md so the next person investigating doesn't go down the same rabbit hole. The remaining open question — why each MV flush is so slow (122-column ZSTD MergeTree insert of a few k rows takes tens of seconds) — is a real follow-up worth profiling, but it's perf, not correctness. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two coordinated changes that together unlock substantially higher MV
throughput in the mixed flavor.
Container memory: 12000m → 14000m
ClickHouse's internal max-memory cap is ~88 % of the container
limit. At 12000m the cap was 10.55 GiB and CH's baseline
MemoryTracking parked at 10.45 GiB constantly. The kafka_engine's
per-batch 131 MiB protobuf decode buffer allocation was rejected
~2 %/min — those messages routed through kafka_handle_error_mode='stream'
to errors_mv and the consumer lost them. Bumping to 14000m raises
the cap to 12.30 GiB.
kafka_poll_max_batch_size: 64 → 16
Bumping the cap alone did NOT help — CH grew to fill the new
headroom (MemoryTracking 10.45 → 12.11 GiB) and the same 131 MiB
allocation still occasionally hit the new cap. WORSE, with more
per-batch memory in flight the per-push processing time during a
rejected allocation exceeded max.poll.interval.ms (5 min default),
the consumer got kicked from the kafka group, rejoined, and
re-read the same batch from the last committed offset → rebalance
death loop (offset frozen for the entire hour I left it running).
batch_size=16 keeps the per-poll buffer at ~33 MiB instead of
~131 MiB, and shortens the per-push processing time enough that
even under memory pressure the consumer stays inside the
poll-interval window. No more rebalance kicks.
Measured at T+31m of a fresh smoke (compared to the prior 8h soak
baseline of 12000m / batch=64 over 480 min):
8h soak baseline This config (31m)
ch_rows 12 237 9 877
total OOMs 1 383 67
rows / minute 25 319 (12.8× faster)
rows per OOM 8.8 147 (16.8× more efficient)
The OOM rate per minute (~2.2/min) is similar to the baseline, but
each OOM costs far fewer rows because the consumer recovers quickly
and the in-flight batch is smaller.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captured from the 20 GB / 28 GB container bumps we tried after the 14 GB / batch=16 validated config. Two non-obvious findings: 1. ClickHouse's MemoryTracking grows to fill whatever per-server cap the container limit implies. The kafka_engine 131 MiB batch alloc keeps tipping the tracker over the cap at the same workload-driven rate (~2.3/min) regardless of how high the cap is set. 2. Past ~20 GB container, per-flush MV insert time grows sharply (8 rows / 37 s at 12 GB → 8 rows / 197 s at 28 GB). That blows past max.poll.interval.ms, the consumer is kicked, and ch_rows freezes in a rebalance death loop — net REGRESSION. The proper fix for the residual OOMs is to cap ClickHouse's discretionary caches via config.d so the tracker stops growing into the cap. That's a separate change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two coordinated changes to bound discretionary memory use in the mixed clickhouse-pipeline-parquet flavor. ClickHouse config.d/limit_memory.xml: * mark_cache_size 5 GiB → 256 MiB * index_mark_cache_size 5 GiB → 128 MiB * uncompressed_cache_size already 0; explicit * index_uncompressed_cache_size 0 * compiled_expression_cache_size 128 MiB (unchanged) * leave max_server_memory_usage_to_ram_ratio at default 0.9 Our working set is tiny (~55 MiB of MergeTree data); the 5 GiB default mark cache was hilariously oversized for the workload. Redpanda: --memory=1G --reserve-memory=0M Was unbounded under --mode=dev-container. Bounded to 1 GiB; observed RSS now 255 MiB. Frees ~700 MiB of host RAM previously over-reserved. Measured at T+31m of a fresh smoke vs the 14000m / batch=16 baseline from commit f6f9a86: baseline this config ch_rows / T+31m 9 877 12 167 (+23 %) total OOMs 67 68 (no change) CH container RSS 9.5 GiB 6.0 GiB (-37 %) MemoryTracking (idle) 12.11 GiB 1.29 GiB errors_mv rows 67 68 The OOM RATE is unchanged because the OOMs come from peak kafka_engine batch processing (transient 10+ GiB allocation across decode buffer + column buffers + compression buffers) — not from the persistent caches. The caches were the steady-state memory consumer; capping them frees the budget for the transient peaks and gives better throughput, but doesn't eliminate the per-batch peak hitting the cap. A real zero-OOM fix would require reducing the per-batch peak allocation itself (smaller kafka_poll_max_batch_size, fewer columns in the MV, or a custom kafka_engine config). Out of scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops `kafka_max_block_size` from 65,536 → 1,024 rows and
`kafka_flush_interval_ms` from 5000 → 2000 ms.
Diagnosis (credit to dave):
Since the migration to the ProtobufList wire format, each kafka
message is already an Envelope containing ~100-1000 XtcpFlatRecord
rows. The kafka_engine's own row-level Block accumulator (default
65,505 rows) sits on top of that batching — it accumulates rows
from many ProtobufList messages before flushing through the MV.
ClickHouse pre-allocates per-column buffers sized for the FULL
Block capacity at flush time. With 122 columns × 65K rows worth
of pre-allocated buffer + ZSTD/LZ4 compression contexts + MV
pipeline state, MemoryTracking parked at ~10 GiB and the 131 MiB
chunk allocations occasionally tipped the per-server memory cap.
None of that memory was data — our actual workload is ~430 rows/sec
≈ 215 KB/sec on the wire.
Setting block_size to ~1 envelope (1024 rows) makes the kafka_engine
effectively pass each ProtobufList through to the MV without
redundant accumulation. Per-flush column buffers are 64× smaller.
Measured before/after on a fresh boot of the mixed flavor:
block=65536 / flush=5s block=1024 / flush=2s
MemoryTracking (idle) 9.31 GiB 178 MiB (53×)
MemoryTracking (peak) 10-12 GiB 246 MiB (40×)
MEMORY_LIMIT_EXCEEDED 67 / 31 min 0
errors_mv rows 68 0
Throughput 319-393 rows/min ~27,000 rows/min (~70×)
Consumer commits / msgs 2 / 426 (rebalance loop) 69 / 69 (1:1)
The throughput now matches xtcp2's actual production rate (~430
rows/sec) — the consumer is running in real-time with no backlog.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Updates the Troubleshooting section to: * mark the earlier "bumping memory doesn't help" entry as historical * document the real fix from c52e4e5: kafka_max_block_size = 1024 + kafka_flush_interval_ms = 2000 * explain WHY ProtobufList + the default 65K-row Block was redundant and over-allocated column buffers * include the before/after measurement table so the next debugger sees what good looks like * note the regression check (SHOW CREATE TABLE to verify the setting hasn't drifted back to the default) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small fixes so the in-VM Prometheus is useful for long-soak stability tracking: 1. Add a host:19090 → guest:9090 forward (was previously commented out). Lets a host-side scrape or curl reach the in-VM TSDB directly without TTY hops. 2. In the clickhouse-pipeline-parquet mixed flavor, add the second xtcp2 instance on :9089 as a scrape target. Both instances now show up as separate `instance` labels (xtcp2-primary, xtcp2-parquet) so goroutine / memory / GC trends can be compared side-by-side over a 24h soak. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small bash helpers for monitoring a running mixed-flavor microvm via its host-forwarded :19090 Prometheus endpoint: * clickpipe-prom-probe.sh — one-line per-instance snapshot of go_goroutines, go_memstats_heap_inuse_bytes (MiB), go_threads for both xtcp2-primary and xtcp2-parquet. Used inside the soak monitor loop for periodic probes. * clickpipe-stability-summary.sh — soak-end report. Queries current/max for goroutines, OS threads, heap, RSS over the soak window, plus total GC pause time. Useful for "did anything drift?" judgement after a 4-24h run. The 4h soak passed with these: 6.3M rows ingested, zero OOMs, goroutine drift bounded at +13-18, heap oscillates normally with GC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two fixes uncovered by a 24h soak that crashed at T+21h: 1. Redpanda was unbounded. Its `start --memory=1G` flag is a seastar data-plane reservation, not an OS cgroup limit — the rest of the process can allocate freely. Over 21h it grew until it triggered the system OOM-killer (`folio_prealloc 12.9 GiB`), which then chose the largest victim (clickhouse-serv at 11.9 GiB RSS) and killed it. The fix is a real docker `--memory=2G` cgroup cap on the redpanda container. 2. `CLICKHOUSE_ALWAYS_RUN_INITDB_SCRIPTS=true` made every container restart re-run initdb.d scripts, which DROP and recreate xtcp.xtcp_flat_records — so when CH crashed during the soak, docker's `--restart on-failure` brought it back but with zero rows. Removed; initdb now runs only on first-time volume init (when /var/lib/clickhouse is empty). Verified by docker-killing the live container — comes back via `docker start`, ch_rows intact (19180 before kill → 24044 after, consumer caught up). Together these mean an OOM-induced or operator-induced CH restart during a 24h soak doesn't lose data, and redpanda can't trigger that OOM in the first place. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…erts A 24h soak retry just got stuck after 1 h: consumer in rebalance death loop, ch_rows frozen at ~21 k, OOMs climbing despite the kafka_max_block_size=1024 fix. Root cause: librdkafka's max.poll.interval.ms is 5 min by default, and our MV flush occasionally takes 30-150 s (memory pressure, parts merge, ZSTD on 122 columns). Once that happens during the startup race window when CH is hot, the consumer gets kicked, rejoins at the last committed offset, re-reads the same batch, fails the same way → indefinite loop. config.d/kafka_client_tuning.xml extends: * max.poll.interval.ms 5 min → 15 min (900000 ms) * session.timeout.ms 45 s → 5 min (300000 ms) * heartbeat.interval.ms explicit 10 s 15 min covers any plausible MV-flush spike. session.timeout.ms stays well below it. The earlier 4h soak completed cleanly only because it happened to dodge this trap; the 24h soak attempts hit it more reliably because of longer total time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A 24h soak v3 attempt got stuck at ~22k rows after 1h. Consumer was no longer in a rebalance death loop (commits succeeding), but MV inserts had gone pathologically slow — Pushing 2.45k rows took 414 seconds. system.asynchronous_metrics shows the cause: jemalloc.retained 18.15 GiB ← held but unused chunks jemalloc.allocated 12.35 GiB MemoryResident 9.44 GiB ← actual physical RAM MarkCacheBytes 0 B ← our caches are capped, fine ClickHouse's MemoryTracker (12.20 GiB) hits its 12.30 GiB cap because of those retained jemalloc chunks even though actual RSS is just 9.44 GiB. Every new alloc has to wait for the tracker to drop below the cap → slow MV inserts. MALLOC_CONF=background_thread:true,dirty_decay_ms:1000,muzzy_decay_ms:1000 tells jemalloc to: * run a background thread that purges unused chunks * mark dirty pages "muzzy" after 1 s of disuse (default 10 s) * return muzzy pages to OS after 1 s (default 10 s) End result: retained chunks return to the OS quickly, MemoryTracker sits well below the cap, MV inserts run at normal speed. This is the standard remedy for long-running ClickHouse instances showing jemalloc.retained bloat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 24h v4 soak ran 0-22h cleanly with the MALLOC_CONF jemalloc fix,
then collapsed at T+22h because:
* /var/lib/docker on the 8 GiB sparse image was 99 % full at T+22h
(CH parts 2.92 GiB + redpanda log + dockerd overhead = 7.3 GiB)
* /var/lib/minio on the default 512 MiB tmpfs was 100 % full —
the parquet path writes ~10 MiB/min and accumulated 507 MiB
of files over 22 h.
* Throughput collapsed to ~5 % of normal once NOT_ENOUGH_SPACE
started firing on every kafka_engine commit.
Fixes:
* microvm.volumes: docker.img 8192 → 16384 MiB
* microvm.volumes: add a dedicated 16384 MiB MinIO image at
/var/lib/minio (gated on isClickPipeParquet)
* minio-bucket-bootstrap.nix: new `useTmpfs` flag (default true)
so the module skips its tmpfs declaration when the caller is
providing a real disk
xtcp2 itself was bulletproof across the full 24h: goroutines drifted
only +37-43 over 24h, OS threads +34-38, heap oscillated normally
with GC, RSS bounded at 247 MiB peak. The "bulletproof 24h" target
is met by the daemon — these changes just keep the supporting
infrastructure from filling up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The s3parquet layer's new Go files and the touched microvm nix files weren't formatted to the repo's pinned gofmt/nixfmt; format them so the gofmt and nix-fmt checks pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…leged The s3parquet layer added a fail-early startup capability check (checkCapabilities → x.fatalf) to Init(). NewXTCP / NewNsTestingXTCP both call Init, so any test that constructs an XTCP (pkg/xtcp TestNewXTCP_runsToCompletion, cmd/ns TestRunDaemonDefault_constructs) os.Exit'd the test binary on sandboxes lacking CAP_SYS_ADMIN / CAP_NET_ADMIN — the stack only ran these inside the cap-granting microVM. Indirect the gate through a package var (matching the existing constructorRegistry / netNsCandidateDirs seams) and add SetCapabilityCheck; TestMain in each package installs a no-op. The capability logic itself is still exercised directly, with the real method, in init_capabilities_test.go. Production behaviour is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dConfig printFlags and buildConfig dereference the s3parquet/pyroscope mainFlags fields the s3parquet layer added, but both test fixtures were never updated to allocate the four pyroscope pointers — so both tests nil-deref panicked. Allocate them like the real defineFlags does. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ine) The test polls for the queueFull counter and breaks the instant it ticks, so a passing run finishes in milliseconds — but the 2s safety deadline was tight enough that a loaded full-suite run (esp. under -race) could trip a false 'counter never ticked' failure. Widen the deadline; the happy path is unaffected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Owner
Author
Note for review: golangci-lint findings deferred to #22
None are in the test-seam / fixture code this PR adds. PR #22 drives the whole tree to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
s3parquet-destination — the final stack layer (32 commits). Adds a direct Parquet → MinIO/S3 destination (retiring the Vector sidecar), plus clickhouse/kafka/MinIO OOM tuning, a fail-early capability check, Pyroscope continuous profiling, and the long-soak/parquet microVM flavors. After this merges,
maincontains the entire advanced stack.Conflicts resolved
cmd/xtcp2/xtcp2.go/xtcp2_test.go/nix/microvms/mkVm.nix: pure gofmt/alignment divergence (main vs the stack base had identical fields, different whitespace) + the s3parquet additions — took the s3parquet version, then re-formatted.Pre-existing test failures found in the branch tip — fixed here
The s3parquet layer shipped with 3 failing test packages (confirmed at the pristine
s3parquet-destinationtip via a clean worktree — the stack's CI only ran them inside the capability-granting microVM):cmd/xtcp2TestPrintFlags / TestBuildConfig — nil-deref: the fixtures never allocated the 4 newpyroscope*mainFlagspointers thatprintFlags/buildConfigdereference. → allocate them.pkg/xtcp/cmd/ns— the new fail-early capability check (Init→checkCapabilities→x.fatalf)os.Exit'd the test binary on sandboxes lackingCAP_SYS_ADMIN/CAP_NET_ADMIN(bothNewXTCPandNewNsTestingXTCPcallInit). → add aSetCapabilityChecktest seam (matching the existingconstructorRegistry/netNsCandidateDirsseams) + aTestMainin each package that installs a no-op. Production fail-fast is unchanged; the cap logic is still tested directly ininit_capabilities_test.go.TestS3ParquetDest_corner_queueFull— flaky: a 2 s safety deadline tripped under full-suite parallel load. → widen to 30 s (happy path unaffected).Testing
go vet ./...+gofmt -l .+ reponix-fmtcheck: clean.go test -ldflags=-checklinkname=0 -tags 'dest_kafka dest_nats dest_nsq dest_valkey dest_s3parquet' ./...— entire suite green across repeated runs (incl. the now-robust queueFull test).capcheck-failmicroVM flavors — those exercise the capability check + the MinIO/parquet round-trip end-to-end and remain microVM-verified.🤖 Generated with Claude Code